Persian Word Sense Disambiguation Corpus Extraction Based on Web Crawler Method
نویسندگان
چکیده
Finding an appropriate dataset for natural language processing applications is one of the main challenges for researches of this field. This issue is more problematic in Non-Latin languages especially Persian language. Access to an appropriate dataset that can be used in development of practical programs in language processing field, helps us to validate the obtained results and provide the feasibility for comparison and precise analysis of the research studies in this field. This paper presents the procedure for extracting a standard dataset in Persian language. This dataset can only be used for research studies in the field of word-sense disambiguation in Persian language. The required documents that include the ambiguous words of interest are collected by a crawling robot; then these words are processed and registered in Persian dataset for ambiguous words. In this research, three prevalent Persian ambiguous word are used for extracting appropriate phrases that included these words. Finally, a framework for creating the proper configuration for application in word-sense disambiguation problems is presented. By using of this method, we have a solution for absence of suitable word sense disambiguation corpus in Persian language.
منابع مشابه
Word Sense Disambiguation by Web mining for word co-occurrence probabilities
This paper describes the National Research Council (NRC) Word Sense Disambiguation (WSD) system, as applied to the English Lexical Sample (ELS) task in Senseval-3. The NRC system approaches WSD as a classical supervised machine learning problem, using familiar tools such as the Weka machine learning software and Brill’s rule-based part-of-speech tagger. Head words are represented as feature vec...
متن کاملImproving the Collocation Extraction Method Using an Untagged Corpus for Persian Word Sense Disambiguation
Word sense disambiguation is used in many natural language processing fields. One of the ways of disambiguation is the use of decision list algorithm which is a supervised method. Supervised methods are considered as the most accurate machine learning algorithms but they are strongly influenced by knowledge acquisition bottleneck which means that their efficiency depends on the size of the tagg...
متن کاملCross-Lingual Word Sense Disambiguation for Languages with Scarce Resources
Word Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large amount of labeled resources as training datasets. In contradistinction to English, the Persian language has neither any semantically tagged cor...
متن کاملImproving translation accuracy in web-based translation extraction
In this paper, we present some approaches to improve translation accuracy in web-based translation extraction. In previous work, the term extraction techniques that researchers used are proposed under large static corpus. We proposed some approaches that can improve the translation accuracy in web-based translation extraction which relies on small dynamic small corpus. We also analyzed the diff...
متن کاملUnsupervised Part of Speech Tagging for Persian
In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through sever...
متن کامل